Fix autotune thread safety (crash) under GIL=0 #8437

Qubitium · 2025-10-13T14:58:48Z

Triton kernel on first launch will call _bench and related code to cache best config to use for subsequent kernel launches. This part of the code is racy and will crash in GIL=0 (Python >= 3.13T). The crash will manifest in cryptic NoneObject has no xx attribute errors to the end-user when in reality it's a data race issue.

I encountered this doing GIL=0 thread based module parallel execution on multiple gpu in GPT-QModel, a quantization toolkit which has triton kernels.

EDIT: Please ignore the folowing block. PR has been updated to use global locks + per key future for multi-threading autotuning sync.

------ Out Dated ---
There were two methods of protection, locks or thread-local. I implemented thread-local for the following reasons:

latency: thread-local (no locks) will be much faster for cache retrieval vs lock protected global cache.
locks are messy and thread-local is actually easier to implement with less code.
thread-local has obvious downside in persistence and duplication of triton configs + extra autotunes per gpu thread. I consider this a small downside with upsides.

Upsides:

If you use python GIL=0 and care about performance, you will likely use thread pools and not launching a thread every single time. This nullifies any persistence downside issue in real-world applications. Launching new thread for every kernel execution is bad practice (as thread startup is much slower than any thing code/lock wise by many factors) and we should not worry too much optimizing for bad code practices.
If you want to further optimzie your threading code, you will likely use persistent threads and bind each thread to a specific cuda:index. With global cache, triton would assume all cuda:index are the same gpu. This actually not the case for my setup and we shouldnt make this assumption for others. Benchmark/config-cache should be keyed to per cuda:index (unique gpu fingerprint) to be frank, but this for another PR/topic. With thread-local we actually achieve optimized kernel launch per cuda:index (potentially unique gpu) since developer should already persist a thread in pool and bind the thread to a cuda:index.

As such, I believe thread-local is a better option over global cache (with lcoks): more scalable, lower latency, and more accurate (indirectly). Unsure how much duplicated config memory this would cost so but I think it's minimal?

test_autotuner_thread_safety extra unit test has been added to existing to existing python/test/unit/runtime/test_autotuner.py test file. Unit test should run on GIL=0 env. Triton CI should setup a Python 3.14T runner for this test.

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - Add new test_autotuner_thread_safety inside existing python/test/unit/runtime/test_autotuner.py
Select one of the following.
- I have not added any lit tests.

… into fix-nogil-autotune

Jokeren · 2025-10-14T12:10:36Z

I probably didn't get the use of thread local here. Doesn't it mean that we need to autotune T times if there are T concurrent threads?

Jokeren · 2025-10-14T12:12:46Z

Also if it happens that two threads use the same GPU, the measured perf won't be the same as the original case where a single thread uses the GPU. As kernels will be launched and executed in an interleaving way from each thread

Qubitium · 2025-10-14T12:31:36Z

Also if it happens that two threads use the same GPU, the measured perf won't be the same as the original case where a single thread uses the GPU. As kernels will be launched and executed in an interleaving way from each thread

Having a global cache or lock doesn't help this situation. You have two threads bound to gpu:0 and one thread is already running pre-tuned kernel work while the second thread also boudn to gpu:0 is entering autotune for the first time on a different kernel. I don't think we can control this. It is up to the end-user to launch new kernel serially and have exclusive locks to gpu:index to make sure the autotune benches are consistent and accurate.

Qubitium · 2025-10-14T12:37:03Z

I probably didn't get the use of thread local here. Doesn't it mean that we need to autotune T times if there are T concurrent threads?

Yes. This is a the down side of thread-local. I think for most cases, end-users wil launch a thread and have it persist for the life time of the model execution instead of luanching random threads each time they need to execute an kernel. The cost of launching threads and ctx switch is quite high. Maybe my view of how threads should be used is different from reality but in my view, users should not be tearing up and down gpu bound execution threads just like you don't launch and tear up gpu processes but have them persist for the duration of the program.

Qubitium · 2025-10-14T12:50:48Z

@Jokeren I think global cache and thread-local cache has both advantages and disadvantages. In my world (GIL=0), thread-local makes more sense but since this PR is for everyone. I don't mind switching back to global cache with a threadig.lock if more users may/or beneift with a global cache usage case.

Jokeren · 2025-10-14T17:18:50Z

I believe that a thread-local implementation may introduce additional inconsistencies.
For instance, different threads could end up selecting different ``best'' kernels due to subtle measurement nuances.

…same key/kernel autotune

Qubitium · 2025-10-15T03:57:33Z

I believe that a thread-local implementation may introduce additional inconsistencies. For instance, different threads could end up selecting different ``best'' kernels due to subtle measurement nuances.

@Jokeren Please re-review. Here are the changes since your last review:

Global cache (revert thread-local)
New global cache lock in _cache_lock (reentrant). cache renamed to _cache since this is locked/dangerous so make it private.
To allow multi-thread concurrent autotune of same/different kernels, a CacheFuture keyed by autotuner key is created and stored in _cache_futures
When thread enteres autotune and cache hit fails, it will create a CacheFuture based on key as a sync mech to block other threads that comes in with same kernel/key. The futures dict is protected by same _cache_lock.
Disk cache appears safe as the cache file is also keyed internally (different files for different keys) and now protected with the same per key futures.

Jokeren · 2025-10-15T11:56:36Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

python/triton/runtime/autotuner.py

Jokeren · 2025-10-16T15:18:59Z

Let's keep the PR open until we have figured out more complete solutions for GIL=0.

Qubitium · 2025-10-17T01:11:22Z

Let's keep the PR open until we have figured out more complete solutions for GIL=0.

Yes. This PR only fixes the autotuner part and does not address a wider issue that may persist in other parts of triton. Are there plans, short term, long term to address this outside the scope of this PR? It would be good to piece-meal the nogil compat. There are lots of apis including python stdlib ones that are not safe under nogil (like gc.collect) and it would be hard to fix it in any single PR.

Jokeren · 2025-10-17T12:56:04Z

Are there plans, short term, long term to address this outside the scope of this PR?

Not yet. So I think we don't plan to merge this PR in the short term. cc @ThomasRaoux

Qubitium · 2025-10-19T20:25:08Z

Are there plans, short term, long term to address this outside the scope of this PR?

Not yet. So I think we don't plan to merge this PR in the short term. cc @ThomasRaoux

Ok. I am hoping this high impact, low risk and low hanging fruit PR will get the consideration before a full-on thread-safe triton roadmap. I am also willing to spend more time on this PR to address any lingering concerns of accuracy, safety or execution pattern deviation from main under gil and no-gil.

Qubitium added 2 commits October 13, 2025 14:19

use global + per key lock to thread protect autotune

de1a201

switch to thread-local for both latency and accuracy

6923eec

Qubitium requested a review from ptillet as a code owner October 13, 2025 14:58

Qubitium added 3 commits October 13, 2025 14:59

format

51a52c2

simplify

2ee77b1

Merge branch 'main' into fix-nogil-autotune

c41aae5

Qubitium mentioned this pull request Oct 14, 2025

Fix GIL=0 segfault and Add GIL=0 compat for regex paths huggingface/transformers#41329

Open

5 tasks

Qubitium added 2 commits October 14, 2025 03:45

format

76089e6

Merge branch 'fix-nogil-autotune' of https://github.com/qubitium/triton…

4211e98

… into fix-nogil-autotune

lezcano requested a review from peterbell10 October 14, 2025 07:47

rename test

2debf8d

Qubitium force-pushed the fix-nogil-autotune branch from f5d62c3 to 2debf8d Compare October 15, 2025 03:27

Qubitium added 2 commits October 15, 2025 11:28

Merge branch 'main' into fix-nogil-autotune

a12e370

use global cache with per key future events for multi-thread sync on …

10a5601

…same key/kernel autotune

Qubitium force-pushed the fix-nogil-autotune branch from cb06283 to 10a5601 Compare October 15, 2025 03:49

chatgpt-codex-connector bot reviewed Oct 15, 2025

View reviewed changes

python/triton/runtime/autotuner.py Show resolved Hide resolved

Merge branch 'main' into fix-nogil-autotune

75241d9

Qubitium added 2 commits October 17, 2025 02:56

Merge branch 'main' into fix-nogil-autotune

d6abe13

Merge branch 'main' into fix-nogil-autotune

871640b

Qubitium added 2 commits October 17, 2025 22:16

Merge branch 'main' into fix-nogil-autotune

aa0b75b

Merge branch 'main' into fix-nogil-autotune

c51b938

Merge branch 'main' into fix-nogil-autotune

30540df

Qubitium mentioned this pull request Oct 25, 2025

Python 3.14t: C extension missing Py_MOD_GIL declaration causes GIL re-enable warning #8491

Open

Merge branch 'main' into fix-nogil-autotune

6f317da

Fix autotune thread safety (crash) under GIL=0 #8437

Are you sure you want to change the base?

Fix autotune thread safety (crash) under GIL=0 #8437

Uh oh!

Conversation

Qubitium commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New contributor declaration

Uh oh!

Jokeren commented Oct 14, 2025

Uh oh!

Jokeren commented Oct 14, 2025

Uh oh!

Qubitium commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Oct 14, 2025

Uh oh!

Jokeren commented Oct 14, 2025

Uh oh!

Qubitium commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jokeren commented Oct 15, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Jokeren commented Oct 16, 2025

Uh oh!

Qubitium commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jokeren commented Oct 17, 2025

Uh oh!

Qubitium commented Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Qubitium commented Oct 13, 2025 •

edited

Loading

Qubitium commented Oct 14, 2025 •

edited

Loading

Qubitium commented Oct 14, 2025 •

edited

Loading

Qubitium commented Oct 15, 2025 •

edited

Loading

Qubitium commented Oct 17, 2025 •

edited

Loading